Introduction to tidyverse

(A few remarks and tips before the practical session)

Quick recap from our R bootcamp yesterday


(We were not supposed to finish everything, so no stress.)

The motivation was to get familiar with the background of what makes a “data frame”.

Vectors and lists

  • Vectors are collections of values of the same type:
sample   <- c("Loschbour", "UstIshim", "Saqqaq", "AltaiNeandertal")
coverage <- c(18.2,        35.2,       13.4,     44.8)
archaic  <- c(FALSE,       FALSE,      FALSE,    TRUE)
  • Lists are collections of anything:
list("Hello", TRUE, 123)
[[1]]
[1] "Hello"

[[2]]
[1] TRUE

[[3]]
[1] 123
… and that “anything” can also include other vectors!

An example of such a list of vectors…


From vectors stored as individual variables…


sample   <- c("Loschbour", "UstIshim", "Saqqaq", "AltaiNeandertal")
coverage <- c(18.2,        35.2,       13.4,     44.8)
archaic  <- c(FALSE,       FALSE,      FALSE,    TRUE)

An example of such a list of vectors…


To those vectors stored as (named) list…


list(
  sample   = c("Loschbour", "UstIshim", "Saqqaq", "AltaiNeandertal"),
  coverage = c(18.2,        35.2,       13.4,     44.8),
  archaic  = c(FALSE,       FALSE,      FALSE,    TRUE)
)

Data frame is just that


A list of vectors…


list(
  sample   = c("Loschbour", "UstIshim", "Saqqaq", "AltaiNeandertal"),
  coverage = c(18.2,        35.2,       13.4,     44.8),
  archaic  = c(FALSE,       FALSE,      FALSE,    TRUE)
)

Data frame is just that


… which is just printed as a table.


data.frame(
  sample   = c("Loschbour", "UstIshim", "Saqqaq", "AltaiNeandertal"),
  coverage = c(18.2,        35.2,       13.4,     44.8),
  archaic  = c(FALSE,       FALSE,      FALSE,    TRUE)
)
           sample coverage archaic
1       Loschbour     18.2   FALSE
2        UstIshim     35.2   FALSE
3          Saqqaq     13.4   FALSE
4 AltaiNeandertal     44.8    TRUE

Indexing into tables: df[rows, cols]


Indexing by columns (“selecting columns”)

df[, c("sample", "coverage")]
           sample coverage
1       Loschbour     18.2
2        UstIshim     35.2
3          Saqqaq     13.4
4 AltaiNeandertal     44.8

Indexing into tables: df[rows, cols]


Indexing by rows (“filtering rows”)

  1. using row numbers:
df[c(2, 3), ]
    sample coverage archaic
2 UstIshim     35.2   FALSE
3   Saqqaq     13.4   FALSE
  1. using TRUE/FALSE for each row:
df[c(FALSE, TRUE, FALSE, TRUE), ]
           sample coverage archaic
2        UstIshim     35.2   FALSE
4 AltaiNeandertal     44.8    TRUE
df[df$coverage > 30, ] # same thing!
           sample coverage archaic
2        UstIshim     35.2   FALSE
4 AltaiNeandertal     44.8    TRUE


tidyverse.org



Nine “core” R packages and a “philosophy of data science design” which inspired many many more specialized packages.

link to the paper

What is tidyverse?

The tidyverse is a language for solving data science challenges with R code. Its primary goal is to facilitate a conversation between a human and a computer about data. Less abstractly, the tidyverse is a collection of R packages that share a high-level design philosophy […] so that learning one package makes it easier to learn the next.

The tidyverse encompasses the repeated tasks at the heart of every data science project: data import, tidying, manipulation, visualisation, and programming.

This is still very abstract

In the spirit of hands-on interactivity, we will leave “theory” and practice work hand-in-hand during exercises.

Further companion study material

https://r4ds.hadley.nz

Let’s talk about our example data

“Western Eurasia witnessed several large-scale human migrations during the Holocene. Here, to investigate the cross-continental effects of these migrations, we shotgun-sequenced 317 genomes—mainly from the Mesolithic and Neolithic periods—from across northern and western Eurasia. These were imputed alongside published data to obtain diploid genotypes from more than 1,600 ancient humans [and about 2,500 present-day humans].”

Our exercises will focus on two MesoNeo data sets:

  • Table of metadata information associated with each sample
  • Genome-wide data set of Identity-by-Descent segments

Why those two data sets?

  • Table of metadata information associated with each sample
  • Genome-wide data set of Identity-by-Descent segments

  1. Best representatives of modern population genetic data
  2. Lots of opportunities to practice tidyverse data processing
  3. Even more opportunities to showcase ggplot2 possibilities

The main reason…

A great example of how to approach totally unfamiliar data!


True story.


Recently, I was given this exact data set. I had to find my way around it, and figure out how to build a project around it.

The exercises are retracing my own data exploration journey!

Let’s get started!

  1. Go to www.bodkan.net/simgen
  2. Click on “Introduction to tidyverse in the left panel
  • This session will focus on the metadata
  • “More tidyverse practice” will dig into the IBD data set
  1. “Cheatsheets and handouts” section in the left panel has a single-page version of these slides and the dplyr cheatsheet
  2. Open your RStudio and start working!